Midterm

Author

Giuliet Kibler

Introduction

The National Health and Nutrition Examination Survey is a survey, typically conducted over a two-year period, to estimate the dietary intake over the 24-hour period prior to the interview of Americans 1 year or older. This particular dataset is a combination of data collected in the 2017-2018 cycle and 2019-March 2020 since the NHANES program was suspended in March of 2020 due to the COVID-19 pandemic. The dietary interview component of this survey is called “What We Eat in America” (WWEIA) and data is collected using the USDA’s Automated Multiple Pass Method (AMPM). All participants are eligible for two survey interviews, the first of which is recorded in person at the Mobile Examination Center, and the second is conducted over the phone 3 to 10 days later. This data set includes dietary information from the first interview and is a log of the total energy and nutrient intakes from foods and beverages within the previous 24-hours. Of particular interest in this dataset is the relationship between pre-pandemic macronutrient dietary intention and true intake. Additionally, there is interest in assessing the relationship between dietary intake and the B vitamins intake. B vitamins are cofactors for many cellular pathways, including cellular metabolism and synthesis of DNA and RNA, but are not stored by the body, so it is critical to replenish them daily through foods and supplements (Hanna et al, 2022). Therefore, this analysis is to assess if Americans 1 year or older are eating their intended macronutrient diet and if their intake is associated with B vitamin levels pre-pandemic.

Methods

The P_DR1TOT dataset for 2017-March 2020 was downloaded from the CDC’s NHANES records of dietary data. This is a dataset from the WWEIA day 1 interviews, conducted pre-pandemic, and includes total dietary intake of participants.

Data variables of interest include 6 variables of special diets, referred to here as intended diet, 5 energy (caloric) and macronutrient variables, and B vitamins 1, 2, and 6. These variables were extracted from the dataset and relabeled to be more informative. 12 intended diets were recorded in separate variables as numbers 1-12 for yes to that diet or missing for no. These variables were altered to 1 for yes and 0 for no. Since low calorie and high calorie diets are labeled separately, a new variable for diet was created where low calorie is 0, high calorie is 1, and neither is 2.

Correlation between intended diet and dietary intake was assessed using summary statistics and box plots.

The proportion of participants below the recommended B vitamins intake levels for men were reported. Correlation between dietary intake, as well as caloric diet type, and B vitamin levels was assessed using scatter plots and linear fitted models.

Data Wrangling & EDA

# Read in table and check
data <- read_xpt("P_DR1TOT.XPT")
head(data)
# A tibble: 6 × 168
    SEQN WTDRD1PP WTDR2DPP DR1DRSTZ DR1EXMER DRABF DRDINT DR1DBIH DR1DAY DR1LANG
   <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1 109263    7619.   17808.        1       14     2      2       4      6       1
2 109264    8236.    7254.        1       81     2      2       5      6       1
3 109265   33535.   35612.        1       88     2      2      19      4       1
4 109266    6831.    5988.        1       81     2      2       4      7       1
5 109269    7876.   18232.        1       88     2      2       9      1       1
6 109270   11254.   14178.        1       81     2      2       3      4       1
# ℹ 158 more variables: DR1MRESP <dbl>, DR1HELP <dbl>, DBQ095Z <dbl>,
#   DBD100 <dbl>, DRQSPREP <dbl>, DR1STY <dbl>, DR1SKY <dbl>, DRQSDIET <dbl>,
#   DRQSDT1 <dbl>, DRQSDT2 <dbl>, DRQSDT3 <dbl>, DRQSDT4 <dbl>, DRQSDT5 <dbl>,
#   DRQSDT6 <dbl>, DRQSDT7 <dbl>, DRQSDT8 <dbl>, DRQSDT9 <dbl>, DRQSDT10 <dbl>,
#   DRQSDT11 <dbl>, DRQSDT12 <dbl>, DRQSDT91 <dbl>, DR1TNUMF <dbl>,
#   DR1TKCAL <dbl>, DR1TPROT <dbl>, DR1TCARB <dbl>, DR1TSUGR <dbl>,
#   DR1TFIBE <dbl>, DR1TTFAT <dbl>, DR1TSFAT <dbl>, DR1TMFAT <dbl>, …
tail(data)
# A tibble: 6 × 168
    SEQN WTDRD1PP WTDR2DPP DR1DRSTZ DR1EXMER DRABF DRDINT DR1DBIH DR1DAY DR1LANG
   <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1 124817   14850.       0         1       88     2      1      -1      2       2
2 124818   15462.   12437.        1       49     2      2       6      7       1
3 124819    4092.    4101.        1       14     2      2      22      4       1
4 124820   34358.   35052.        1       86     2      2      18      4       1
5 124821    3044.       0         1       81     2      1       0      7       1
6 124822       0       NA         2       14    NA     NA      NA      7       2
# ℹ 158 more variables: DR1MRESP <dbl>, DR1HELP <dbl>, DBQ095Z <dbl>,
#   DBD100 <dbl>, DRQSPREP <dbl>, DR1STY <dbl>, DR1SKY <dbl>, DRQSDIET <dbl>,
#   DRQSDT1 <dbl>, DRQSDT2 <dbl>, DRQSDT3 <dbl>, DRQSDT4 <dbl>, DRQSDT5 <dbl>,
#   DRQSDT6 <dbl>, DRQSDT7 <dbl>, DRQSDT8 <dbl>, DRQSDT9 <dbl>, DRQSDT10 <dbl>,
#   DRQSDT11 <dbl>, DRQSDT12 <dbl>, DRQSDT91 <dbl>, DR1TNUMF <dbl>,
#   DR1TKCAL <dbl>, DR1TPROT <dbl>, DR1TCARB <dbl>, DR1TSUGR <dbl>,
#   DR1TFIBE <dbl>, DR1TTFAT <dbl>, DR1TSFAT <dbl>, DR1TMFAT <dbl>, …
dim(data)
[1] 14300   168
sapply(data, class)
     SEQN  WTDRD1PP  WTDR2DPP  DR1DRSTZ  DR1EXMER     DRABF    DRDINT   DR1DBIH 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
   DR1DAY   DR1LANG  DR1MRESP   DR1HELP   DBQ095Z    DBD100  DRQSPREP    DR1STY 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
   DR1SKY  DRQSDIET   DRQSDT1   DRQSDT2   DRQSDT3   DRQSDT4   DRQSDT5   DRQSDT6 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
  DRQSDT7   DRQSDT8   DRQSDT9  DRQSDT10  DRQSDT11  DRQSDT12  DRQSDT91  DR1TNUMF 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TKCAL  DR1TPROT  DR1TCARB  DR1TSUGR  DR1TFIBE  DR1TTFAT  DR1TSFAT  DR1TMFAT 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TPFAT  DR1TCHOL  DR1TATOC  DR1TATOA   DR1TRET  DR1TVARA  DR1TACAR  DR1TBCAR 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TCRYP  DR1TLYCO    DR1TLZ   DR1TVB1   DR1TVB2  DR1TNIAC   DR1TVB6  DR1TFOLA 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
   DR1TFA    DR1TFF  DR1TFDFE   DR1TCHL  DR1TVB12  DR1TB12A    DR1TVC    DR1TVD 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
   DR1TVK  DR1TCALC  DR1TPHOS  DR1TMAGN  DR1TIRON  DR1TZINC  DR1TCOPP  DR1TSODI 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TPOTA  DR1TSELE  DR1TCAFF  DR1TTHEO  DR1TALCO  DR1TMOIS  DR1TS040  DR1TS060 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TS080  DR1TS100  DR1TS120  DR1TS140  DR1TS160  DR1TS180  DR1TM161  DR1TM181 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TM201  DR1TM221  DR1TP182  DR1TP183  DR1TP184  DR1TP204  DR1TP205  DR1TP225 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DR1TP226   DR1_300  DR1_320Z  DR1_330Z  DR1BWATZ   DR1TWSZ    DRD340   DRD350A 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD350AQ   DRD350B  DRD350BQ   DRD350C  DRD350CQ   DRD350D  DRD350DQ   DRD350E 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD350EQ   DRD350F  DRD350FQ   DRD350G  DRD350GQ   DRD350H  DRD350HQ   DRD350I 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD350IQ   DRD350J  DRD350JQ   DRD350K    DRD360   DRD370A  DRD370AQ   DRD370B 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD370BQ   DRD370C  DRD370CQ   DRD370D  DRD370DQ   DRD370E  DRD370EQ   DRD370F 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD370FQ   DRD370G  DRD370GQ   DRD370H  DRD370HQ   DRD370I  DRD370IQ   DRD370J 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD370JQ   DRD370K  DRD370KQ   DRD370L  DRD370LQ   DRD370M  DRD370MQ   DRD370N 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD370NQ   DRD370O  DRD370OQ   DRD370P  DRD370PQ   DRD370Q  DRD370QQ   DRD370R 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
 DRD370RQ   DRD370S  DRD370SQ   DRD370T  DRD370TQ   DRD370U  DRD370UQ   DRD370V 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
# Impute multiple columns at once
data_avgs <- data |>
  summarise(
    median_cal = median(DR1TKCAL, na.rm = TRUE),
    median_sug = median(DR1TSUGR, na.rm = TRUE),
    median_carb = median(DR1TCARB, na.rm = TRUE),
    median_fat = median(DR1TTFAT, na.rm = TRUE),
    median_pro = median(DR1TPROT, na.rm = TRUE),
    median_B1 = median(DR1TVB1, na.rm = TRUE),
    median_B2 = median(DR1TVB2, na.rm = TRUE),
    median_B6 = median(DR1TVB6, na.rm = TRUE))
print(data_avgs)
# A tibble: 1 × 8
  median_cal median_sug median_carb median_fat median_pro median_B1 median_B2
       <dbl>      <dbl>       <dbl>      <dbl>      <dbl>     <dbl>     <dbl>
1       1824       90.8        219.       71.8       64.8      1.31      1.59
# ℹ 1 more variable: median_B6 <dbl>
# Create categorical variables
high_low <- data |>
  mutate(
    cat_Calories = ifelse(DR1TKCAL > as.numeric(data_avgs['median_cal']), 1, 0),
    cat_Sugar = ifelse(DR1TSUGR > as.numeric(data_avgs['median_sug']),1, 0),
    cat_Carbohydrate = ifelse(DR1TCARB > as.numeric(data_avgs['median_carb']), 1, 0),
    cat_Fat = ifelse(DR1TTFAT > as.numeric(data_avgs['median_fat']), 1, 0),
    cat_Protein = ifelse(DR1TPROT > as.numeric(data_avgs['median_pro']), 1, 0)
  ) |>
  select(starts_with("cat_"))  # Keep only the categorical columns
print(high_low)
# A tibble: 14,300 × 5
   cat_Calories cat_Sugar cat_Carbohydrate cat_Fat cat_Protein
          <dbl>     <dbl>            <dbl>   <dbl>       <dbl>
 1            0         0                0       0           0
 2            0         0                0       0           0
 3            1         1                1       1           0
 4            0         1                0       1           0
 5            0         0                0       0           0
 6            1         1                1       0           0
 7            1         0                0       1           1
 8           NA        NA               NA      NA          NA
 9            0         1                1       0           0
10            1         0                1       1           1
# ℹ 14,290 more rows
[1] "Head of Categorization of Diet Type (0 = Low Calorie, 1 = High Calorie and 2 = No Caloric Diet)"
[1] 2 2 2 0 2 2
# Summary of Extracted Data
head(extracted_data)
# A tibble: 6 × 15
  Calories Low_Calorie High_Calorie Sugar Low_Sugar Carbohydrate
     <dbl>       <dbl>        <dbl> <dbl>     <dbl>        <dbl>
1     1402           0            0  73.4         0         188.
2     1046           0            0  27.9         0         122.
3     1926           0            0 157.          0         247.
4     1698           1            0  94.2         0         218.
5     1251           0            0  84.8         0         160.
6     1973           0            0 134.          0         273.
# ℹ 9 more variables: Low_Carbohydrate <dbl>, Fat <dbl>, Low_Fat <dbl>,
#   Protein <dbl>, High_Protein <dbl>, B1 <dbl>, B2 <dbl>, B6 <dbl>,
#   Calorie_diet <dbl>
tail(extracted_data)
# A tibble: 6 × 15
  Calories Low_Calorie High_Calorie Sugar Low_Sugar Carbohydrate
     <dbl>       <dbl>        <dbl> <dbl>     <dbl>        <dbl>
1     1131           0            0  51.0         0         90.3
2     3868           0            0 279.          0        512. 
3     1749           0            0 108.          0        197. 
4     1204           0            0  65.8         0        158. 
5     1698           0            0  50.6         0        111. 
6       NA           0            0  NA           0         NA  
# ℹ 9 more variables: Low_Carbohydrate <dbl>, Fat <dbl>, Low_Fat <dbl>,
#   Protein <dbl>, High_Protein <dbl>, B1 <dbl>, B2 <dbl>, B6 <dbl>,
#   Calorie_diet <dbl>
dim(extracted_data)
[1] 14300    15
sapply(extracted_data, class)
        Calories      Low_Calorie     High_Calorie            Sugar 
       "numeric"        "numeric"        "numeric"        "numeric" 
       Low_Sugar     Carbohydrate Low_Carbohydrate              Fat 
       "numeric"        "numeric"        "numeric"        "numeric" 
         Low_Fat          Protein     High_Protein               B1 
       "numeric"        "numeric"        "numeric"        "numeric" 
              B2               B6     Calorie_diet 
       "numeric"        "numeric"        "numeric" 

Are macronutrient diet types related to the categorical variables? Ideally, all 1 values = 1

Some people on a macronutrient diet are not eating ingesting their diet plan when compared to the overall median; further investigation below.

Preliminary Results

Investigate correlation between intended diet and dietary intake

[1] "Summary of Sugar Intake"


|Low_Sugar |      Mean| Median|  Min|    Max| Count|
|:---------|---------:|------:|----:|------:|-----:|
|0         | 105.68458|  90.89| 0.00| 931.16| 14226|
|1         |  81.42301|  62.24| 5.16| 414.14|    74|
[1] "Summary of Carbohydrate Intake"


|Low_Carbohydrate |     Mean|  Median|  Min|     Max| Count|
|:----------------|--------:|-------:|----:|-------:|-----:|
|0                | 239.5814| 219.080| 0.00| 1586.24| 14143|
|1                | 178.5371| 165.395| 9.55|  673.39|   157|
[1] "Summary of Fat Intake"


|Low_Fat |     Mean| Median|  Min|    Max| Count|
|:-------|--------:|------:|----:|------:|-----:|
|0       | 81.03713| 71.805| 0.00| 567.96| 14154|
|1       | 80.43219| 72.620| 5.72| 253.70|   146|
[1] "Summary of Protein Intake"


|High_Protein |      Mean| Median|   Min|    Max| Count|
|:------------|---------:|------:|-----:|------:|-----:|
|0            |  72.16154|  64.72|  0.00| 545.20| 14261|
|1            | 108.33923| 100.61| 15.03| 309.18|    39|

Summary Table for Caloric Diets



| Calorie_diet|     Mean| Median| Min|   Max| Count|
|------------:|--------:|------:|---:|-----:|-----:|
|            0| 1968.596|   1834| 100|  7375|   799|
|            1| 2669.773|   2528| 553|  7632|    46|
|            2| 1995.732|   1821|   0| 12501| 13455|

Investigate correlations between dietary intake and B vitamins

Proportion of Participants Below B Vitamin Recommended Intake for Men (Hanna et al., 2022)

  Vitamin Proportion_Below
1      B1        0.4317301
2      B2        0.3513557
3      B6        0.2571821

Graphing B Vitamins vs Dietary Intake

## Individual plots are made because it looks better
create_scatter_plots <- function(data) {
  # List of nutrient variables to plot
  nutrient_vars <- c("Calories", "Carbohydrate", "Fat", "Protein")
  # List of vitamin variables to plot against
  vitamin_vars <- c("B1", "B2", "B6")
  
  # Define recommended values for B vitamins
  recommended_values <- list(
    B1 = 1.2,
    B2 = 1.3,
    B6 = 1
  )
  
  # Loop through each nutrient variable
  for (nutrient in nutrient_vars) {
    # Check if the nutrient exists in the data
    if (!nutrient %in% colnames(data)) {
      message(paste("Nutrient variable", nutrient, "not found in the data. Skipping."))
      next
    }
    
    # Loop through each vitamin variable
    for (vitamin in vitamin_vars) {
      # Check if vitamin exists in the data
      if (vitamin %in% colnames(data)) {
        # Prepare the data for modeling
        temp_data <- data[, c(nutrient, vitamin)]
        colnames(temp_data) <- c("Nutrient", "Vitamin")
        
        # Filter out missing and non-finite values
        temp_data <- temp_data |>
          na.omit() |>
          filter(is.finite(Nutrient) & is.finite(Vitamin))
        
        # Check if there are enough data points for modeling
        if (nrow(temp_data) > 1) {
          # Calculate the linear model and R^2
          model <- lm(Vitamin ~ Nutrient, data = temp_data)
          r_squared <- summary(model)$r.squared
          
          # Create the scatter plot with line of best fit
          scatter_plot <- ggplot(temp_data, aes(x = Nutrient, y = Vitamin)) +
            geom_point() +
            geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Line of best fit
            labs(x = nutrient, y = vitamin) +
            theme_minimal() +
            ggtitle(paste("Scatter Plot of", vitamin, "vs", nutrient, "\nR² =", round(r_squared, 3)))
          
          # Add horizontal lines for recommended values
          scatter_plot <- scatter_plot +
            geom_hline(yintercept = recommended_values[[vitamin]], linetype = "dashed", color = "red", linewidth = 0.7) +
            annotate("text", x = max(temp_data$Nutrient, na.rm = TRUE), y = recommended_values[[vitamin]], 
                     label = "Men's Recommended Dietary Intake", vjust = -0.5, color = "red")
          
          # Print the scatter plot
          print(scatter_plot)
        } else {
          message(paste("Not enough data points for", vitamin, "vs", nutrient))
        }
      } else {
        message(paste("Vitamin variable", vitamin, "not found in the data. Skipping."))
      }
    }
  }
}

# Call the function with the data
create_scatter_plots(extracted_data)

Caloric Diet’s Effect on B vitamins

Conclusion

Overall, the average of pre-pandemic participants’ intended diets is associated with their true intake. All mean values for intake are right skewed by high outliers, so medians were assessed. The low sugar group averaged lower total sugars than those not on the diet (62.24 vs 90.89 g). Additionally, the low carbohydrate group ate less carbohydrates than those not on this diet (165.395 vs 219.08 g). On the other hand, those on a low fat diet ate more fats than those not on the diet (72.62 vs 71.805 g), meaning the average participant on a low fat diet did not eat less fats than other participants. The average high protein diet had substantially more protein intake than those not on the diet (100.61 vs 64.72 g). Finally, the average high calorie diet included substantially higher caloric intake than either the low calorie diet or those not on a caloric diet (2528 vs 1834 and 1821 cals), but the low calorie diet was actually slightly higher than those not on a caloric diet (1834 vs 1821 cals), meaning the average participant on a low calorie diet ate more calories than those not intending to calorically restrict. Although, all IQRs of dietary intake by intended diet are overlapping, so following an intended diet is variable for Americans pre-pandemic. This lack of conclusivity makes since considering dietary needs are relative to a person’s physiological demands.

B vitamins have a moderate association with dietary intake. B1 and B2 vitamins are more strongly associated with all of the macronutrient intakes than B6, with the highest correlation occurring between B1 and caloric intake an B1 and carbohydrate intake. This demonstrates that getting enough dietary nutrition is critical for B1 and B2 vitamin daily replenishment. Interestingly, caloric diet type’s association with the B vitamins was not consistent between the vitamins, indicating that more than just than dietary intention is necessary for sufficient B vitamin intake. With 43% of participants below the recommended B1 levels guidelines for men, 35% below the guidelines for B2 levels, and 26% below the guidelines for B6 levels, many participants should eat more macronutrients and overall calories to meet the body’s B vitamin demands. In conclusion, the average American 1 year or older is eating their intended macronutrient diet and their intake is moderately associated with B vitamin levels pre-pandemic.